Kernel method for nonlinear Granger causality 
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Important information on the structure of complex systems, consisting of more than one compo- 
nent, can be obtained by measuring to which extent the individual components exchange information 
among each other. Such knowledge is needed to reach a deeper comprehension of phenomena rang- 
ing from turbulent fluids to neural networks, as well as complex physiological signals. The linear 
Granger approach, to detect cause-effect relationships between time series, has emerged in recent 
years as a leading statistical technique to accomplish this task. Here we generalize Granger causality 
to the nonlinear case using the theory of reproducing kernel Hilbert spaces. Our method performs 
linear Granger causality in the feature space of suitable kernel functions, assuming arbitrary de- 
gree of nonlinearity. We develop a new strategy to cope with the problem of overfitting, based 
on the geometry of reproducing kernel Hilbert spaces. Applications to coupled chaotic maps and 
physiological data sets are presented. 

PACS numbers: 05.10.-a,05.45.Tp,87.10.+e,89.70.-|-c 



Experiments in many fields of science provide time se- 
ries of simultaneously recorded variables. The analysis 
of the synchronization betvi^een time series [l| is an im- 
portant tool to study communications betvi^een different 
components of a complex systems. In many systems, 
however, it is important not only to detect synchronized 
states, but also to identify cause-effect (drive-response) 
relationships between components |j, |3|. Information- 
theoretic approaches to causality are based on the esti- 
mation of entropy and mutual information [J, |5|, |6| , an 
hard numerical problem when conditioning with respect 
to a large number of variables is to be done. Another ma- 
jor approach to analyze causality between two time series 
has been proposed by Granger [3] : if the prediction error 
of the first time series is reduced by including measure- 
ments from the second one in the linear regression model, 
then the second time series is said to have a causal in- 
fluence on the first one. This linear frame for measuring 
causality has been widely applied in many fields, includ- 
ing rheochaosjS] neurophysiology [9|] , economy [10] , and 
climatology [ll| . The importance of Granger causality is 
the suggestion to use prediction (and tools from learning 
theory [12] in particular) to measure the amount of infor- 
mation exchanged by two (sub)systems; it is worth men- 
tioning that also a measure of self-organization, rooted 
on optimal predictors, has been recently proposed p^ . 
Some attempts to extend Granger causality to the non- 
linear case have been recently proposed [14| . The main 
problem of all approaches is detection of false causalities 
[15J . which may arise due to over-fitting of the learning 
scheme. 

The purpose of this work is to present a novel ap- 
proach which measures Granger causality of time series, 
assuming arbitrary degree of nonlinearity, while control- 
ling overfitting, and thus avoiding the problem of false 
causalities. To this aim we exploit the properties of ker- 
nel machines, the state-of-the-art in learning models [iq . 



We start describing the connection between Granger 
causality and information-theoretic approaches like the 
transfer entropy Te in [^. Let {£,n}n=i,.,N+m be a 
time series that may be approximated by a stationary 
Markov process of order m, i.e. p(^„|^„_i, . . . ,^„_„i) — 
p(Cn|Cn-i7 • ■ • jCn-m-i)- Wc wiU use the shorthand no- 
tation Xi = {ii,---,^i+7n-iV and Xi = ^i+,„, for i = 
I, . . . , N , and treat these quantities as N realizations of 
the stochastic variables X and x. The minimizer of the 
risk functional, R[f] — J dXdx {x — f{X)) p{X, x), rep- 
resents the best estimate of x, given X, and corresponds 
[rn | to the regression function f*{X) = J dxp{x\X)x. 
Now, let {r]n}n=i,.,N+m bc another time series of si- 
multaneously acquired quantities, and denote Yi = 
{rji, . . . , rji+m-i)^ ■ The best estimate of x, given X and 
Y, is now: g*{X,Y) = J dxp{x\X,Y)x. If the general- 
ized Markov property holds, i.e. 



p{x\X,Y)^p{x\X), 



(1) 



then f*{X) — g*{X, Y) and the knowledge of Y does not 
improve the prediction of x. Te [4| is a measure of the 
violation of H]) : it follows that Granger causality implies 
non-zero transfer entropy. 

Due to the finiteness of iV, the risk functional cannot 
be evaluated; we consider the empirical risk ER [/] = 
X]i=i {^j ~ fi-^i)) I and the search for the minimum of 
ER is constrained in a suitable functional space, called 
hypothesis space; the simplest choice is the space of 
all linear functions, corresponding to linear regression. 
In the following we propose a geometrical description 
of linear Granger causality. For each a S {!,..., m}, 
the samples of the a-th component of X form a vec- 
tor Uq G 5R^; without loss of generality we assume that 
each Ua has zero mean and that x=(2:i, . . . , xat)^ is nor- 
malized and zero mean. We denote Xi the value of the 
linear regression of x versus X, evaluated at Xi. The 
vector x=(xi, . . . , x~n)'^ can be obtained as follows. Let 



H C 5?^ be the span of Ui, . . . , u^,; then x is the projec- 
tion of X on H . In other words, caUing P the projector on 
the space H , we have x = Px. Moreover, the prediction 
error, given X, is e^; = ||x — x|p = 1 — x^x. CaUing X 
the mx N matrix having vectors u^ as rows, H coincides 
with the range of the N x N matrix K = X^X. 

Using both X and Y, the values of the hnear regression 
form the vector x' = P'x, P' being the projector on 
the space H' C ^^ ^ spanned by the Ui, . . . , u^ and the 
components of 1" Vi , . . . , v^ (assumed to be zero mean) . 
H' is the range of the matrix K' = Z Z, where Z is the 
2m X N matrix with vectors u^ and Vq as rows. The 
prediction error is now e^y — ||x — x'|p = 1 — x'^x'. 
We now note that H C H', hence H' = H ® H^. The 
last formula shows geometrically the enlargement of the 
hypothesis space, due to the inclusion of the Y variables. 
Calling P^ the projector on H^, we have: exy = tx — 
||P-^x|p, and the linear Granger causality index reads: 



(5{r^X) 



||F^x||^ 

l-X^x" 



(2) 



Linear Granger causality is usually assessed according to 
well known test statistics, see e.g. 01 ■ Instead of assess- 
ing the presence (or not) of causality by means of a single 
statistical test, and in view of the non-linear extension, 
we introduce a causality index which by construction is 
not affected by over-fitting. We observe that H^ is the 
range of the matrix K = K'-PK'-K'P-fPK'P. Hence 
the natural choice of the orthonormal basis in H-^ is the 
set of the eigenvectors, with non vanishing eigenvalue, 
of K. Calling ti,...,tm these eigenvectors, we have: 
||P^x|p = J27Li ''"ii where r^ is the Pearson's correlation 
coefficient of x and t,. To avoid false causalities, we first 
evaluate, by Student's t test, the probability tt^ that r^ is 
due to chance, assuming x and ti normal. Since we are 
dealing with multiple comparison, we use the Bonferroni 
correction to select the eigenvectors tj/, correlated with 
X, with expected fraction of false positive equal to 0.05. 
Then we calculate a new causality index by summing 
only over the {r;'} which pass the Bonferroni test, thus 
obtaining what we call filtered linear Granger causality 
index: 



Sf (Y ^ X) 



E.^ 



1 



X ' X 



(3) 



Exchanging the roles of the two time series, we may 
evaluate the causality index in the opposite direction 
Sf{X^Y). 

The formulation of linear Granger causality, above de- 
scribed, allows an efficient generalization to the nonlin- 
ear case using methods of the theory of Reproducing 
Kernel Hilbert Spaces (RKHS) [11]. Let us first deal 
with the problem of predicting x using the knowledge of 
X. Given a kernel function k, with spectral represen- 
tation k{X,X') = J2a^a'^a{X)'^a{X'), wc cousider H, 
the range of the N x N Gram matrix K with elements 
Kij = k{Xi,Xj). As in the linear case, we calculate x. 



the projection of x onto H. Due to the spectral repre- 
sentation of fc, X coincides with the linear regression of x 
in the feature space spanned by i/A^^q, the eigenfunc- 
tions of fc; the regression is nonlinear in the original vari- 
ables. We remark that H corresponds to the functional 
space where well known methods, like Support Vector 
Machines and Kernel Ridge Regression, search for the 
regressor flGj . 

While using both X and Y to predict x, we append X 
and Y variables to construct the Z variable with sam- 
ples Zi — (XiYi)^ ; then we evaluate the Gram ma- 
trix K' with elements K'^, — k(Zi,Zj). The regres- 
sion values now form vector x' equal to the projection 
of X on H', the range of K'. In this work we con- 
sider two choices of the kernel (see the discussion in 
[l8|): the inhomogeneous polynomial (IP) of integer or- 
der p: kp{X,X') ^ {l + X^X'Y , and the Gaussian: 

kcr{X,X') ~ cxp I — -5^ 2J3 J, whose complex- 
ity depends on the scale parameter a. 

First we consider the IP kernel. In this case the eigen- 
functions ^a are all the monomials, in the input vari- 
ables, up to the p — th degree. In this case H C H' , and 
we can proceed as in the linear case, decomposing H' = 
H®H^ and calculating K = K' - PK' - K'P + PK'P. 
Along the same lines as those described in the linear case, 
we may construct the filtered Granger causality taking 
into account only the eigenvectors of K which pass the 
Bonferroni test. We discuss some examples of applica- 
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FIG. 1: The filtered causality index, for the coupled maps, is 
displayed versus e for three values of s. The inhomogeneous 
polynomial kernel with p = 2 is used, and m = 1. 

tion of our method with IP kernel. First we consider two 
unidirectionally coupled noisy logistic maps: 



Vn+i = (1 - e)(l - ayl) + e(l - axl) + S7„; 



(4) 



{t} and {7} are unit variance Gaussianly distributed 
noise terms (the parameter s determines their relevance), 
a = 1.8 and e G [0, 1] represents the coupling x ^ y. In 
the noise- free case (s = 0), a transition to complete syn- 
chronization [l| occurs at e = 0.37. Varying e and s, we 



p 


1 


2 


3 


4 


5 


Sf 


0.04 


0.88 


0.81 


0.66 


0.41 



TABLE I: Causalities x ^ y for coupled Henon's maps. 



have considered runs of N iterations, after a transient of 
10'^, and evaluated Sp, using the kernel with p = 2, in 
both directions. We find that 5f {Y ^ X) is zero for all 
values of e, s and N. On the other hand Sp (X -^ Y) 
is zero at e smaller than a threshold Cc, see figure 1. 
Sp {X —^ Y) is zero also at complete synchronization, as 
there is no information transfer in this regime. As noted 
in [1^1 , the causal relation can be inferred only when the 
coupling is not large enough to let full synchronization 
emerge, or when the synchronized state is frequently per- 
turbed by internal or external noise driving the system 
out of the synchronized state. Indeed, at fixed e > 0.37, 
6p {X — > Y) is zero until s reaches a threshold Sc- Both 
Cc and Sc scale as N~°-^, as expected, see figure 2. 



synchronization of neurons activity in the neighborhood 
of the electrode at which the signal was recorded. In 
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FIG. 3: The filtered causality indexes, for the rat EEG signals 
before (top) and after the lesion (bottom), is displayed versus 
p, the order of the inhomogeneous polynomial kernel. 
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FIG. 2: Scaling of the critical values of e and s (see the text) 
with A''. Be also has a weak dependence on s. 

As another simulated example, we consider two unidi- 
rectionally coupled Henon maps: 



yn = 1.4 + 0.3y„_i - 0.7yl_i - 0.3a;„_iy„_i; y„ 



yn-i; 



and analyze the causality between time series {x} and 
{y}; by construction, x is driving y. Using m — 2 and 
IP kernel with various values of p, on runs of length N — 
1000, we correctly find that the causality y ^ x is always 
zero whilst the causality x — > y is non zero and maximal 
at p = 2, the interaction being quadratic (Table I). 



A real example consists in rat EEG signals from right (R) 
and left (L) cortical intracranial electrodes, employed in 
the study of the pathophysiology of epilepsy and already 
analyzed in [l9|. We analyze both the normal EEG sig- 
nals and the EEG signals from the same rat after uni- 
lateral lesion in the rostral pole of the reticular thalamic 
nucleus, when spike discharges are observed due to local 



figure 3 top, the indexes Sp, for the normal EEG sig- 
nals of the rat, are depicted for p — 1,2,3,4. We find 
zero causality in the direction L— s-R for all p, and a small 
causality R^L only aX p — 1. After the unilateral lesion 
(figure 3 bottom) we find that causality R^L is almost 
unchanged, whilst a relevant L^R causality now appears 
at p = 1 and (smaller) at p = 2. A more conservative 
statistical procedure, in situations where the value oi p is 
not known a priori, is to apply, at each p, the Bonferroni's 
correction corresponding to the total number of compar- 
isons, in this case 91 (=2-f9-|-25-f55); using this correc- 
tion, the causality R— s-L becomes zero in both cases and 
for all p, whilst the causality L— >R remains unchanged 
and equal to the values depicted in figure 3. The results 
reported in [l^ are qualitatively consistent with our find- 
ings, indeed the same directions of asymmetry are found 
in the two analyses, but our approach allows to make 
more sharp and precise statements about the causality 
relationships between the two EEG signals: the only sta- 
tistically robust causality relationship is L^R after the 
lesion. Moreover, as the maximum of Sp{L — > R) occurs 
at p — 1 , our analysis seems to suggest that in this exper- 
iment the information transfer mechanism is essentially 
linear [20]. 

Turning to consider the Gaussian kernel, the condition 
H C H' does not necessarily hold and some differences in 
the approach are in order. In this case we call H the span 
of the eigenvectors of K whose eigenvalue is not smaller 
than ^Xmax, where Xmax is the largest eigenvalue of K 
and ^ is a small number (we use ^ — 10^^). We calculate 
X = Px, where P is the projector on H . After evaluating 
the Gram matrix K', the following matrix is considered: 



K* 






(5) 



where {w} are the eigenvectors of K', and the sum 
is over the eigenvalues {pi\ not smaller than /i times 




log(r) 



FIG. 4: (Top) The filtered causality indexes, for the physionet 
data-set, are displayed versus a, the width of the Gaussian 
kernel. (Bottom) Transfer entropies versus r, the length scale. 



the largest eigenvalue of K'. Then we evaluate K — 
K* - PK* - K*P + PK*P, and denote P^ the projec- 
tor onto the range of K. The filtered Granger causality 
index for Gaussian kernels is then constructed as in the 
previous cases. As another real example, we consider 
time series of heart rate (H) and breath rate (B) of a 
sleeping human suffering from sleep apnea (ten minutes 
from data set B of the Santa Fe Institute time series con- 
test held in 1991, available in the Physionet data bank 
|2l|). Using IP kernels, we find unidirectional causality 
H^B; its strength increases with the order p of the ker- 
nel, from 5f = 0.01 at p = 1 to Sp = 0.03 at p = 5. 
These findings confirm the strongly nonlinear nature of 
the interaction between heart and respiration signals in 
sleep apnea syndrome Q, which is evident also using 
the Gaussian kernel and varying a, as depicted in fig- 
ure 4 top. No causality B^H is found to be significa- 



tive, whilst non zero causality H^B is found for cr > 1. 
Note that the causality index vanishes, by construction, 
at small a and at large a, because in both limits the ker- 
nel matrix tends to be constant (0 and 1, respectively). 
In figure 4 bottom the bivariate time series is analyzed 
by means of the transfer entropy 4] . It is interesting to 
compare the two approaches in this application. Te is 
nonzero in both directions and shows a slightly stronger 
flow of information H— >B. Our approach recognizes, as 
significative, only the causality H— >B, thus revealing uni- 
directional drive-response relationship in the sleep apnea 
pathology. 

In conclusion, we considered the problem of nonlin- 
ear coherence of signals, in particular the detection of 
drive-response relationships. Exploiting the geometry of 
reproducing kernel Hilbert spaces we have introduced a 
filtered index which is able to measure cause-effect rela- 
tionships with arbitrary amount of nonlinearity, and is 
not affected by over-fitting. The choice of the optimal 
value of m can be done using the standard cross valida- 
tion scheme [la] or the embedding dimension [23|. Our 
method is equivalent to perform linear Granger causal- 
ity in the feature space of the kernel, hence also in the 
nonlinear case our approach continues to fulfill the good 
properties of linear models. The framework of Granger 
causality assumes stationarity of signals: further work 
should deal with the effects of non-stationarities on non- 
linear estimates of causalities (see [2J| for a promising 
strategy in the linear case). 

We expect that the proposed method will provide a sta- 
tistically robust basis to assess nonlinear drive-response 
relationships in many fields of science, wherever collected 
data form time series; it works for deterministic and 
stochastic systems, provided that noise is not so high 
to obscure the deterministic effects. 
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